* ============================================================================ *
* Author: Benjamin Rosche
* Date:   July 7, 2018
* Topic:  Longitudinal negative binomial regression
* ============================================================================ *		  

version 15 // indicates version
cls // clears backscroll display buffer
clear all // start from scratch
set more off // suppresses -more- prompts

* ============================================================================ *
* Load data
* ============================================================================ *

cd "C:\Users\benja\OneDrive - Cornell University\GitHub\sas\prog\analysis\data"
use liss-recoded.dta, clear

* Import multiply-imputed data in the ICE (flong) format and register imputed 
* variables 
mi import ice
mi register imputed simpc_hh hhsize hhurban income_hh income_hh2 age age2 ///
   female eduyr eduyr2 migrantd trust voted opcosts agreeablenessscore joyscore ///
   valscore burscore joymean valmean burmean joydev valdev burdev  
   
* Recode wave starting at 1
mi xeq: replace wave = wave - 7
mi register regular wave

* Sample selection to calibrate model on waves 2008 - 2013
mi xeq: gen calibrate13 = 0
mi xeq: replace calibrate13 = 1 if wave <= 6
mi register regular calibrate13

* ============================================================================ *
* Prepare variables
* ============================================================================ * 

global sas joymean joydev valmean valdev burmean burdev
global sas_trait joymean valmean burmean
global cov female age eduyr migrantd hhtype_sel income_hh hhurban simpc_hh hhsize trust voted opcosts agreeablenessscore 

* ============================================================================ *
* Descriptive statistics (uncommented right now)
* ============================================================================ *

* Create sample variable, finsamp1 and finsamp2, for descriptives
mi estimate, cmdok dots esample(finsamp1): menbreg completed wave $sas $cov if calibrate13==1, exp(invited) vce(cluster id_hh) || id: // sample 2008-13
mi estimate, cmdok dots esample(finsamp2): menbreg completed wave $cov, exp(invited) vce(cluster id_hh) || id:                        // sample incl. 2014 and 15

/*

* Fraction missing 
sum frm frm2 

* Assert that no duplicates per imputation file and wave (i.e. only one observation per id)
mi xeq: bysort wave id: gen N = _N
assert N == 1
drop N // no duplicate entries

* Sample size

** N individuals 
tab wave if _mi_m == 1 & finsamp1 == 1 
tab wave if _mi_m == 1 & finsamp2 == 1                                      
// within wave, we count id's; as we have 5 imputations, we only take one of them;
// finsamp1 takes those obs from the calibration sample (2008-13)
// finsamp2 takes those obs from the prediciton sample (2014-15)
					   
** N households 		
bys _mi_m wave id_hh (id): gen nvals = (_n == 1)
bys _mi_m wave (id_hh id): regen nvals = total(nvals), replace	 				   
tab nvals if _mi_m == 1 & samp2 == 1

** N of waves per id 
bys _mi_m id (wave): gen nvals2 = _N
tab nvals2 if _mi_m == 1

* Descriptives of all variables	
sum completed invited wave female age eduyr migrantd hhtype_sel income_hh hhurban simpc_hh hhsize trust voted opcosts agreeablenessscore joymean valmean burmean joydev valdev burdev if _mi_m == 1 & samp2 == 1		

tab wave, sum(invited) 
tab wave, sum(completed) 

*/

* ============================================================================ *
*Study 2: Calibrating the model 
* ============================================================================ *
	
* Only wave (to calculate R2)
mi estimate, cmdok dots saving(miest0, replace): menbreg completed wave if calibrate13==1 & finsamp1==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using miest0, eform

* Only covariates 
mi estimate, cmdok dots saving(miest1, replace): menbreg completed wave $cov if calibrate13==1 & finsamp1==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using miest1, eform post

putexcel set ../output/nb, replace sheet("Calibration - Covariates")
ereturn display, eform(eform)
matrix a = r(table)'
putexcel A1 = "Coefficient"
putexcel B1 = "SE"
putexcel C1 = "t statistic"
putexcel D1 = "p value"
putexcel A2 = matrix(a)

* Only SAS
mi estimate, cmdok dots saving(miest2, replace): menbreg completed wave $sas if calibrate13==1 & finsamp1==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using miest2, eform post

putexcel set ../output/nb, modify sheet("Calibration - SAS")
ereturn display, eform(eform)
matrix a = r(table)' 
putexcel A1 = "Coefficient"
putexcel B1 = "SE"
putexcel C1 = "t statistic"
putexcel D1 = "p value"
putexcel A2 = matrix(a)

* SAS + covariates 
mi estimate, cmdok dots saving(miest3, replace): menbreg completed wave $sas $cov if calibrate13==1 & finsamp1==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using miest3, eform post

putexcel set ../output/nb, modify sheet("Calibration - SAS + covariates")
ereturn display, eform(eform)
matrix a = r(table)'
putexcel A1 = "Coefficient"
putexcel B1 = "SE"
putexcel C1 = "t statistic"
putexcel D1 = "p value"
putexcel A2 = matrix(a)

** Wald test of all covariates being 0
mi test $cov
mi test $sas

* ============================================================================ *
* Study 3: Forcasting nonresponse 
* ============================================================================ *

* CREATING SAS MEANSCORES (DIFFERENT TIME WINDOWS) =========================== *

// The SAS mean and deviation on the full sample, i.e. 2008-13, has already been
// created. Here, we create the SAS mean on basis of:

* 2008 
mi xeq: bysort id (wave): gen joy08 = joyscore
mi xeq: bysort id (wave): replace joy08 = . if wave > 1

mi xeq: bysort id (wave): gen val08 = valscore
mi xeq: bysort id (wave): replace val08 = . if wave > 1

mi xeq: bysort id (wave): gen bur08 = burscore 
mi xeq: bysort id (wave): replace bur08 = . if wave > 1

* 2008-10 
mi xeq: bysort id (wave): egen joy10 = mean(joyscore) if wave <= 3
mi xeq: bysort id (wave): egen val10 = mean(valscore) if wave <= 3
mi xeq: bysort id (wave): egen bur10 = mean(burscore) if wave <= 3

mi xeq: bysort id (wave): replace joy10 = . if wave > 3
mi xeq: bysort id (wave): replace val10 = . if wave > 3 
mi xeq: bysort id (wave): replace bur10 = . if wave > 3 

* 2008-13 
*  ... $sas_trait ...

* WE ALSO LIMIT THE INFORMATION OF THE COVARIATES THE SAME WAY =============== *

foreach cov of global cov {
	mi xeq: bysort id (wave): gen  `cov'08 = `cov'
	mi xeq: bysort id (wave): replace  `cov'08 = . if wave > 1 // covariates with information from 2008
	
	mi xeq: bysort id (wave): gen  `cov'10 = `cov'
	mi xeq: bysort id (wave): replace  `cov'10 = . if wave > 3 // covariates with information till 2010
	
	mi xeq: bysort id (wave): gen  `cov'13 = `cov'
	mi xeq: bysort id (wave): replace  `cov'13 = . if wave > 6 // covariates with information till 2013
}

* Save as linear predictor
global sas08 joy08 val08 bur08
global sas10 joy10 val10 bur10

global cov08 female08 age08 eduyr08 migrantd08 hhtype_sel08 income_hh08 hhurban08 simpc_hh08 hhsize08 trust08 voted08 opcosts08 agreeablenessscore08 
global cov10 female10 age10 eduyr10 migrantd10 hhtype_sel10 income_hh10 hhurban10 simpc_hh10 hhsize10 trust10 voted10 opcosts10 agreeablenessscore10 
global cov13 female13 age13 eduyr13 migrantd13 hhtype_sel13 income_hh13 hhurban13 simpc_hh13 hhsize13 trust13 voted13 opcosts13 agreeablenessscore13 

* ============================================================================ *
* Save and load the data
* ============================================================================ *

save mi_data, replace
use mi_data, clear

* ============================================================================ *
* Model estimation
* ============================================================================ *

* M4: covariates + SAS (trait + state) (2008-13)
* Comment: I start with M4 because this model determines the ultimate sample size
mi estimate, cmdok dots esample(NBpredSamp) saving(m4, replace): menbreg completed wave $sas_trait joydev valdev burdev $cov13, exp(invited) vce(cluster id_hh) || id:
mi estimate using m4, eform post

* M1: covariates only (2013)
mi estimate, cmdok dots saving(m1, replace): menbreg completed wave $cov13 if NBpredSamp==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using m1, eform post


* M2: SAS (trait) only

* 2008
mi estimate, cmdok dots saving(m2_1, replace): menbreg completed wave $sas08 if NBpredSamp==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using m2_1, eform post

* 2008-10
mi estimate, cmdok dots saving(m2_2, replace): menbreg completed wave $sas10 if NBpredSamp==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using m2_2, eform post

* 2008-13
mi estimate, cmdok dots saving(m2_3, replace): menbreg completed wave $sas_trait if NBpredSamp==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using m2_3, eform post

* M3: covariates + SAS (trait)

* 2008-13
mi estimate, cmdok dots saving(m3, replace): menbreg completed wave $sas_trait $cov13 if NBpredSamp==1, exp(invited) vce(cluster id_hh) || id:
mi estimate using m3, eform post

* ============================================================================ *
* mi corr (user-written program) 
* See http://www.stata.com/statalist/archive/2010-07/msg01382.html 
* Using [mi xeq: corr x y] gives exactly same results
* ============================================================================ *

cap program drop ecorr
program ecorr, eclass
	version 12
	syntax [varlist] [if] [in] [aw fw] [, * ]
	if (`"`weight'"'!="") {
		local wgt `weight'`exp'
	}
	marksample touse
	correlate `varlist' `if' `in' `wgt', `options'
	tempname b V
	mata: st_matrix("`b'", vech(st_matrix("r(C)"))')
	local p = colsof(`b')
	mat `V' = J(`p',`p',0)
	local cols: colnames `b'
	mat rownames `V' = `cols'
	eret post `b' `V' [`wgt'] , obs(`=r(N)') esample(`touse')
	eret local cmd ecorr
	eret local title "Lower-diagonal correlation matrix"
	eret local vars "`varlist'"

end

cap program drop micorr
program micorr, rclass
	tempname esthold
	_estimates hold `esthold', nullok restore
	qui mi estimate, cmdok: ecorr `0'
	tempname C_mi
	mata: st_matrix("`C_mi'", invvech(st_matrix("e(b_mi)")'))
	mat colnames `C_mi' = `e(vars)'
	mat rownames `C_mi' = `e(vars)'
	di
	di as txt "Multiple-imputation estimate of the correlation matrix"	
	di as txt "(obs=" string(e(N_mi),"%9.0g") ")"
	matlist `C_mi'
	return clear
	ret matrix C_mi = `C_mi'

end
	
* ============================================================================ *
* Save and load the data
* ============================================================================ *

save mi_data, replace
	
* ============================================================================ *
* Part II: Predicting the response rate
* ============================================================================ *

* ============================================================================ *
* Observed response rates in 2014 and 2015
* ============================================================================ *

// We don't get a rate but a proportion (completed/invited). To make it a rate,
// we multiply the proportion by the average number of invitation of that year.

qui sum invited if wave == 6
mi xeq: gen op_14 = (completed / invited) * `r(mean)' if wave == 6 // in 2014

qui sum invited if wave == 7
mi xeq: gen op_15 = (completed / invited) * `r(mean)' if wave == 7 // in 2015

mi xeq: bysort id: regen op_14 = max(op_14), replace
mi xeq: bysort id: regen op_15 = max(op_15), replace

mi register regular op_14 op_15

* ============================================================================ *
* Model-predict response rate in 2014
* ============================================================================ *

// In order to predict the 2014 response rate, we have to pretend, it is 
// 2014. Thus, wave = 6 and we have to assume the number of invitations of 
// that year.

mi xeq: replace wave = 6
mi xeq: replace invited = (invited * (wave == 6))
mi xeq: bysort id (wave): regen invited = max(invited), replace

// Within each calculation base (e.g. 2008-10), we now take 
// - the last covariate information available, i.e. 2010
// - the trait facet of the SAS, which is the mean over 2008-10
// - the number of invitations in the future, i.e. from 2014
// - pretend that wave == 2014
// Note that using predict, xb gives the log-count which is equal to the response rate

* M1: covariates only (2013)
mi predict pp_m1 using m1, xb storecompleted 
mi xeq: bysort id: regen pp_m1 = pp_m1[6], replace 

* M2: SAS (trait) only
* 2008
mi predict pp_m21 using m2_1, xb storecompleted  
mi xeq: bysort id: regen pp_m21 = pp_m21[1], replace
* 2008-10
mi predict pp_m22 using m2_2, xb storecompleted  
mi xeq: bysort id: regen pp_m22 = pp_m22[3], replace
* 2008-13
mi predict pp_m23 using m2_3, xb storecompleted  
mi xeq: bysort id: regen pp_m23 = pp_m23[6], replace

* M3: covariates + SAS (trait) (2008-13)
mi predict pp_m3 using m3, xb storecompleted
mi xeq: bysort id: regen pp_m3 = pp_m3[6], replace

* M4: covariates + SAS (trait + state) (2008-13)
mi predict pp_m4 using m4, xb storecompleted
mi xeq: bysort id: regen pp_m4 = pp_m4[6], replace

// Per respondent, we keep 1 observation
mi xeq: bysort id (wave): gen n = _n
mi xeq: keep if n == 1

* ============================================================================ *
* Correlation between model-predicted and observed response rate
* ============================================================================ *

* M1: covariates only (2013)
micorr pp_m1 op_14

* M2: SAS (trait) only
* 2008
micorr pp_m21 op_14
* 2008-10
micorr pp_m22 op_14
* 2008-13
micorr pp_m23 op_14

* M3: covariates + SAS (trait) (2008-13)
micorr pp_m3 op_14

* M4: covariates + SAS (trait + state) (2008-13)
micorr pp_m4 op_14

* ============================================================================ *
* Model-predict response rate in 2015
* ============================================================================ *

use mi_data, clear

mi xeq: replace wave = 7
mi xeq: replace invited = (invited * (wave == 7))
mi xeq: bysort id (wave): regen invited = max(invited), replace

* M1: covariates only (2013)
mi predict pp_m1 using m1, xb storecompleted 
mi xeq: bysort id: regen pp_m1 = pp_m1[6], replace 

* M2: SAS (trait) only
* 2008
mi predict pp_m21 using m2_1, xb storecompleted  
mi xeq: bysort id: regen pp_m21 = pp_m21[1], replace
* 2008-10
mi predict pp_m22 using m2_2, xb storecompleted  
mi xeq: bysort id: regen pp_m22 = pp_m22[3], replace
* 2008-13
mi predict pp_m23 using m2_3, xb storecompleted  
mi xeq: bysort id: regen pp_m23 = pp_m23[6], replace

* M3: covariates + SAS (trait) (2008-13)
mi predict pp_m3 using m3, xb storecompleted
mi xeq: bysort id: regen pp_m3 = pp_m3[6], replace

* M4: covariates + SAS (trait + state) (2008-13)
mi predict pp_m4 using m4, xb storecompleted
mi xeq: bysort id: regen pp_m4 = pp_m4[6], replace

// Per respondent, we keep 1 observation
mi xeq: bysort id (wave): gen n = _n
mi xeq: keep if n == 1

* ============================================================================ *
* Correlation between model-predicted and observed response rate
* ============================================================================ *

* M1: covariates only (2013)
micorr pp_m1 op_15

* M2: SAS (trait) only
* 2008
micorr pp_m21 op_15
* 2008-10
micorr pp_m22 op_15
* 2008-13
micorr pp_m23 op_15

* M3: covariates + SAS (trait) (2008-13)
micorr pp_m3 op_15

* M4: covariates + SAS (trait + state) (2008-13)
micorr pp_m4 op_15

* eof
